Computer Vision and Pattern Recognition 8
♻ ☆ MECD+: Unlocking Event-Level Causal Graph Discovery for Video Reasoning NeurIPS 2024
Video causal reasoning aims to achieve a high-level understanding of videos
from a causal perspective. However, it exhibits limitations in its scope,
primarily executed in a question-answering paradigm and focusing on brief video
segments containing isolated events and basic causal relations, lacking
comprehensive and structured causality analysis for videos with multiple
interconnected events. To fill this gap, we introduce a new task and dataset,
Multi-Event Causal Discovery (MECD). It aims to uncover the causal relations
between events distributed chronologically across long videos. Given visual
segments and textual descriptions of events, MECD identifies the causal
associations between these events to derive a comprehensive and structured
event-level video causal graph explaining why and how the result event
occurred. To address the challenges of MECD, we devise a novel framework
inspired by the Granger Causality method, incorporating an efficient mask-based
event prediction model to perform an Event Granger Test. It estimates causality
by comparing the predicted result event when premise events are masked versus
unmasked. Furthermore, we integrate causal inference techniques such as
front-door adjustment and counterfactual inference to mitigate challenges in
MECD like causality confounding and illusory causality. Additionally, context
chain reasoning is introduced to conduct more robust and generalized reasoning.
Experiments validate the effectiveness of our framework in reasoning complete
causal relations, outperforming GPT-4o and VideoChat2 by 5.77% and 2.70%,
respectively. Further experiments demonstrate that causal relation graphs can
also contribute to downstream video understanding tasks such as video question
answering and video event prediction.
comment: IEEE TPAMI Submission. continuous work of arXiv:2409.17647 (NeurIPS
2024)
♻ ☆ Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models
Large Vision-Language Models (LVLMs) have achieved remarkable success in a
wide range of multimodal tasks by integrating pre-trained vision encoders and
large language models. However, current LVLMs primarily rely on visual features
extracted from the final layers of the vision encoder, overlooking the
complementary information available in shallower layers. While recent
approaches have explored the use of multilayer visual features in LVLMs, they
tend to be task-agnostic and fail to examine the dependencies of hierarchical
visual features on specific tasks. To address these gaps, we systematically
investigate the contributions of visual features from different encoder layers
using 18 benchmarks spanning 6 task categories. Our findings reveal that
multilayer features provide complementary strengths with varying task
dependencies, and uniform fusion leads to suboptimal performance. Building on
these insights, we propose the instruction-guided vision aggregator, a module
that dynamically integrates multi-layer visual features based on textual
instructions, without increasing the number of visual tokens. Extensive
evaluations demonstrate the superior performance of our method. Additionally,
an in-depth analysis of the aggregator's behavior highlights the dominance of
mid-to-high-level features in semantic-rich tasks and the critical role of
low-level features in fine-grained perception.
♻ ☆ IOR: Inversed Objects Replay for Incremental Object Detection
Existing Incremental Object Detection (IOD) methods partially alleviate
catastrophic forgetting when incrementally detecting new objects in real-world
scenarios. However, many of these methods rely on the assumption that unlabeled
old-class objects may co-occur with labeled new-class objects in the
incremental data. When unlabeled old-class objects are absent, the performance
of existing methods tends to degrade. The absence can be mitigated by
generating old-class samples, but it incurs high costs. This paper argues that
previous generation-based IOD suffers from redundancy, both in the use of
generative models, which require additional training and storage, and in the
overproduction of generated samples, many of which do not contribute
significantly to performance improvements. To eliminate the redundancy, we
propose Inversed Objects Replay (IOR). Specifically, we generate old-class
samples by inversing the original detectors, thus eliminating the necessity of
training and storing additional generative models. We propose augmented replay
to reuse the objects in generated samples, reducing redundant generations.
Moreover, we propose high-value knowledge distillation focusing on the
positions of old-class objects overwhelmed by the background, which transfers
the knowledge to the incremental detector. Extensive experiments conducted on
MS COCO 2017 demonstrate that our method can efficiently improve detection
performance in IOD scenarios with the absence of old-class objects.
♻ ☆ Text-guided Synthetic Geometric Augmentation for Zero-shot 3D Understanding
Kohei Torimi, Ryosuke Yamada, Daichi Otsuka, Kensho Hara, Yuki M. Asano, Hirokatsu Kataoka, Yoshimitsu Aoki
Zero-shot recognition models require extensive training data for
generalization. However, in zero-shot 3D classification, collecting 3D data and
captions is costly and laborintensive, posing a significant barrier compared to
2D vision. Recent advances in generative models have achieved unprecedented
realism in synthetic data production, and recent research shows the potential
for using generated data as training data. Here, naturally raising the
question: Can synthetic 3D data generated by generative models be used as
expanding limited 3D datasets? In response, we present a synthetic 3D dataset
expansion method, Textguided Geometric Augmentation (TeGA). TeGA is tailored
for language-image-3D pretraining, which achieves SoTA in zero-shot 3D
classification, and uses a generative textto-3D model to enhance and extend
limited 3D datasets. Specifically, we automatically generate text-guided
synthetic 3D data and introduce a consistency filtering strategy to discard
noisy samples where semantics and geometric shapes do not match with text. In
the experiment to double the original dataset size using TeGA, our approach
demonstrates improvements over the baselines, achieving zeroshot performance
gains of 3.0% on Objaverse-LVIS, 4.6% on ScanObjectNN, and 8.7% on ModelNet40.
These results demonstrate that TeGA effectively bridges the 3D data gap,
enabling robust zero-shot 3D classification even with limited real training
data and paving the way for zero-shot 3D vision application.
♻ ☆ LayerAnimate: Layer-specific Control for Animation
Animated video separates foreground and background elements into layers, with
distinct processes for sketching, refining, coloring, and in-betweening.
Existing video generation methods typically treat animation as a monolithic
data domain, lacking fine-grained control over individual layers. In this
paper, we introduce LayerAnimate, a novel architectural approach that enhances
fine-grained control over individual animation layers within a video diffusion
model, allowing users to independently manipulate foreground and background
elements in distinct layers. To address the challenge of limited layer-specific
data, we propose a data curation pipeline that features automated element
segmentation, motion-state hierarchical merging, and motion coherence
refinement. Through quantitative and qualitative comparisons, and user study,
we demonstrate that LayerAnimate outperforms current methods in terms of
animation quality, control precision, and usability, making it an ideal tool
for both professional animators and amateur enthusiasts. This framework opens
up new possibilities for layer-specific animation applications and creative
flexibility. Our code is available at https://layeranimate.github.io.
comment: Project page: https://layeranimate.github.io
♻ ☆ Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models
We present Deep Compression Autoencoder (DC-AE), a new family of autoencoder
models for accelerating high-resolution diffusion models. Existing autoencoder
models have demonstrated impressive results at a moderate spatial compression
ratio (e.g., 8x), but fail to maintain satisfactory reconstruction accuracy for
high spatial compression ratios (e.g., 64x). We address this challenge by
introducing two key techniques: (1) Residual Autoencoding, where we design our
models to learn residuals based on the space-to-channel transformed features to
alleviate the optimization difficulty of high spatial-compression autoencoders;
(2) Decoupled High-Resolution Adaptation, an efficient decoupled three-phases
training strategy for mitigating the generalization penalty of high
spatial-compression autoencoders. With these designs, we improve the
autoencoder's spatial compression ratio up to 128 while maintaining the
reconstruction quality. Applying our DC-AE to latent diffusion models, we
achieve significant speedup without accuracy drop. For example, on ImageNet
512x512, our DC-AE provides 19.1x inference speedup and 17.9x training speedup
on H100 GPU for UViT-H while achieving a better FID, compared with the widely
used SD-VAE-f8 autoencoder. Our code is available at
https://github.com/mit-han-lab/efficientvit.
comment: Preprint. First two authors contributed equally to this work. Update:
fix typo
♻ ☆ Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific
person images according to the given textual descriptions. A primary challenge
in this task is bridging the substantial representational gap between visual
and textual modalities. The prevailing methods map texts and images into
unified embedding space for matching, while the intricate semantic
correspondences between texts and images are still not effectively constructed.
To address this issue, we propose a novel TIPR framework to build fine-grained
interactions and alignment between person images and the corresponding texts.
Specifically, via fine-tuning the Contrastive Language-Image Pre-training
(CLIP) model, a visual-textual dual encoder is firstly constructed, to
preliminarily align the image and text features. Secondly, a Text-guided Image
Restoration (TIR) auxiliary task is proposed to map abstract textual entities
to specific image regions, improving the alignment between local textual and
visual embeddings. Additionally, a cross-modal triplet loss is presented to
handle hard samples, and further enhance the model's discriminability for minor
differences. Moreover, a pruning-based text data augmentation approach is
proposed to enhance focus on essential elements in descriptions, thereby
avoiding excessive model attention to less significant information. The
experimental results show our proposed method outperforms state-of-the-art
methods on three popular benchmark datasets, and the code will be made publicly
available at https://github.com/Delong-liu-bupt/SEN.
comment: The paper was withdrawn due to a dispute among the authors regarding
the content of the article
♻ ☆ Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding
We introduce Tarsier2, a state-of-the-art large vision-language model (LVLM)
designed for generating detailed and accurate video descriptions, while also
exhibiting superior general video understanding capabilities. Tarsier2 achieves
significant advancements through three key upgrades: (1) Scaling pre-training
data from 11M to 40M video-text pairs, enriching both volume and diversity; (2)
Performing fine-grained temporal alignment during supervised fine-tuning; (3)
Using model-based sampling to automatically construct preference data and
applying DPO training for optimization. Extensive experiments show that
Tarsier2-7B consistently outperforms leading proprietary models, including
GPT-4o and Gemini 1.5 Pro, in detailed video description tasks. On the DREAM-1K
benchmark, Tarsier2-7B improves F1 by 2.8\% over GPT-4o and 5.8\% over
Gemini-1.5-Pro. In human side-by-side evaluations, Tarsier2-7B shows a +8.6\%
performance advantage over GPT-4o and +24.9\% over Gemini-1.5-Pro. Tarsier2-7B
also sets new state-of-the-art results across 15 public benchmarks, spanning
tasks such as video question-answering, video grounding, hallucination test,
and embodied question-answering, demonstrating its versatility as a robust
generalist vision-language model.